Introduction: I analyzed the movie dataset. The dataset contains 10866 movies described over 21 features. After checking for errors and outliers, and cleaning accordingly, the dataset remained 3653 movies and 16 features, of which I attempted to answer the following questions:
How does the median budget compare to the mean budget?
What does the budget distribution look like?
What are the 10 most expensive movies in the dataset?
Are those movies profitable?
What is the relationship between budget and revenue?
What is the trend of movie budgets across the years?
What are the top 10 grossing movies?
Did returns on movie increase over the years?
What are the most successful movies?
Which movies are the least successful?
Who are the most successful directors?
Which companies generated the most revenues in the dataset?
Which movies have the most votes in the dataset?
Which movies are the most popular in the dataset?
Is there a correlation between popularity and revenue generated?
Did heavily budgeted movies turn out popular?
What is the mean rating for all movies based on the voting averages?
What is the minimum votes required to be in the top chart given a 95% percentile?
How many movies were released in each month in the dataset?
Which month do movies generate the most revenue?
Which month did movies recorded the highest profit rate?
Did movie runtime increase over the years?
What are the shortest movies ever produced?
What are the longest movies ever produced?
How correlated are the features measured in the dataset?
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from mpl_toolkits.mplot3d import Axes3D
from sklearn.preprocessing import StandardScaler
import plotly.express as px
from sklearn.metrics.pairwise import linear_kernel
#uploading the moviedataset
moviedata = pd.read_csv("/Users/adedayo/Desktop/ALX-T/tmdb-movies.csv", header = 0, na_values = (' ', 'NULL'))
moviedata.head()
| id | imdb_id | popularity | budget | revenue | original_title | cast | homepage | director | tagline | ... | overview | runtime | genres | production_companies | release_date | vote_count | vote_average | release_year | budget_adj | revenue_adj | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 135397 | tt0369610 | 32.985763 | 150000000 | 1513528810 | Jurassic World | Chris Pratt|Bryce Dallas Howard|Irrfan Khan|Vi... | http://www.jurassicworld.com/ | Colin Trevorrow | The park is open. | ... | Twenty-two years after the events of Jurassic ... | 124 | Action|Adventure|Science Fiction|Thriller | Universal Studios|Amblin Entertainment|Legenda... | 6/9/15 | 5562 | 6.5 | 2015 | 137999939.3 | 1.392446e+09 |
| 1 | 76341 | tt1392190 | 28.419936 | 150000000 | 378436354 | Mad Max: Fury Road | Tom Hardy|Charlize Theron|Hugh Keays-Byrne|Nic... | http://www.madmaxmovie.com/ | George Miller | What a Lovely Day. | ... | An apocalyptic story set in the furthest reach... | 120 | Action|Adventure|Science Fiction|Thriller | Village Roadshow Pictures|Kennedy Miller Produ... | 5/13/15 | 6185 | 7.1 | 2015 | 137999939.3 | 3.481613e+08 |
| 2 | 262500 | tt2908446 | 13.112507 | 110000000 | 295238201 | Insurgent | Shailene Woodley|Theo James|Kate Winslet|Ansel... | http://www.thedivergentseries.movie/#insurgent | Robert Schwentke | One Choice Can Destroy You | ... | Beatrice Prior must confront her inner demons ... | 119 | Adventure|Science Fiction|Thriller | Summit Entertainment|Mandeville Films|Red Wago... | 3/18/15 | 2480 | 6.3 | 2015 | 101199955.5 | 2.716190e+08 |
| 3 | 140607 | tt2488496 | 11.173104 | 200000000 | 2068178225 | Star Wars: The Force Awakens | Harrison Ford|Mark Hamill|Carrie Fisher|Adam D... | http://www.starwars.com/films/star-wars-episod... | J.J. Abrams | Every generation has a story. | ... | Thirty years after defeating the Galactic Empi... | 136 | Action|Adventure|Science Fiction|Fantasy | Lucasfilm|Truenorth Productions|Bad Robot | 12/15/15 | 5292 | 7.5 | 2015 | 183999919.0 | 1.902723e+09 |
| 4 | 168259 | tt2820852 | 9.335014 | 190000000 | 1506249360 | Furious 7 | Vin Diesel|Paul Walker|Jason Statham|Michelle ... | http://www.furious7.com/ | James Wan | Vengeance Hits Home | ... | Deckard Shaw seeks revenge against Dominic Tor... | 137 | Action|Crime|Thriller | Universal Pictures|Original Film|Media Rights ... | 4/1/15 | 2947 | 7.3 | 2015 | 174799923.1 | 1.385749e+09 |
5 rows × 21 columns
moviedata.shape
(10866, 21)
The dataset includes 10866 and 21 features
moviedata.columns
Index(['id', 'imdb_id', 'popularity', 'budget', 'revenue', 'original_title',
'cast', 'homepage', 'director', 'tagline', 'keywords', 'overview',
'runtime', 'genres', 'production_companies', 'release_date',
'vote_count', 'vote_average', 'release_year', 'budget_adj',
'revenue_adj'],
dtype='object')
The features used in the dataset are: id, imb_id, Popularity, Budget, revenue, original_title, cast, Homepage, Director, Tagline, Keywords, Overview, Runtime, Genres, Production_companies, Release_date, Vote_count, Vote_average, Release_year, Budget_adj, Revenue_adj.
I do not need certain features listed on the dataset, so I will drop them to make my analysis easier and neater
#dropping the columns not useful to my analysis :
moviedata.drop(['imdb_id','homepage', 'budget', 'revenue', 'tagline'],axis=1,inplace=True)
The revenue_adj and budget columns use scientific notations for the number of zeros. Suppressing the notations will further ease my analysis
pd.set_option('display.float_format', lambda x: '%.2f' % x)
I need to check for non-null values and the data types of the columns in the dataset to make sure the columns are ready for analysis
moviedata.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 10866 entries, 0 to 10865 Data columns (total 16 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 id 10866 non-null int64 1 popularity 10866 non-null float64 2 original_title 10866 non-null object 3 cast 10790 non-null object 4 director 10822 non-null object 5 keywords 9373 non-null object 6 overview 10862 non-null object 7 runtime 10866 non-null int64 8 genres 10843 non-null object 9 production_companies 9836 non-null object 10 release_date 10866 non-null object 11 vote_count 10866 non-null int64 12 vote_average 10866 non-null float64 13 release_year 10866 non-null int64 14 budget_adj 10866 non-null float64 15 revenue_adj 10866 non-null float64 dtypes: float64(4), int64(4), object(8) memory usage: 1.3+ MB
Interestingly none of the columns has null values, which is good news. However, a few columns have the wrong data types ascribed to them. I will modify them accordingly.
#converting release_year and id to object :
moviedata['release_year'] = moviedata['release_year'].astype(object)
moviedata['id'] = moviedata['id'].astype(object)
#Renaming some of the columns to ease my analysis:
moviedata.rename(columns={'budget_adj': 'budget',
'revenue_adj': 'revenue',
'original_title': 'title'
},
inplace=True, errors='raise')
Next, I will count the number of unique values in the dataset
moviedata.nunique(axis = 0)
id 10865 popularity 10814 title 10571 cast 10719 director 5067 keywords 8804 overview 10847 runtime 247 genres 2039 production_companies 7445 release_date 5909 vote_count 1289 vote_average 72 release_year 56 budget 2614 revenue 4840 dtype: int64
Next, I will check for possible duplication of values
moviedata.duplicated().value_counts()
False 10865 True 1 dtype: int64
There is one duplicated value in the dataset, so I will drop the value.
moviedata = moviedata.drop_duplicates()
Checking for null values in the dataset
moviedata.isnull().sum()
id 0 popularity 0 title 0 cast 76 director 44 keywords 1493 overview 4 runtime 0 genres 23 production_companies 1030 release_date 0 vote_count 0 vote_average 0 release_year 0 budget 0 revenue 0 dtype: int64
#dropping the null values:
moviedata.dropna(inplace=True)
Checking for summary stats to find out possible outliers
moviedata.describe()
| popularity | runtime | vote_count | vote_average | budget | revenue | |
|---|---|---|---|---|---|---|
| count | 8666.00 | 8666.00 | 8666.00 | 8666.00 | 8666.00 | 8666.00 |
| mean | 0.74 | 103.82 | 264.20 | 6.00 | 21307490.97 | 63624403.82 |
| std | 1.09 | 26.21 | 635.20 | 0.89 | 37102690.20 | 159287366.71 |
| min | 0.00 | 0.00 | 10.00 | 1.50 | 0.00 | 0.00 |
| 25% | 0.25 | 91.00 | 20.00 | 5.50 | 0.00 | 0.00 |
| 50% | 0.45 | 100.00 | 54.00 | 6.10 | 2130702.68 | 189123.75 |
| 75% | 0.84 | 113.00 | 200.00 | 6.60 | 28104657.57 | 55248572.50 |
| max | 32.99 | 705.00 | 9767.00 | 8.70 | 425000000.00 | 2827123750.00 |
The summary statistics above shows there are zero values in the budget, which in reality, is not possible. The zero values most likely come from poor data collection. To improve the reliability of my analysis, I will be using 7,000 dollars as the minimum for budget and 30 dollars for the revenue. These values according to Google are the lowest ever recorded movie budget and revenue values in history.
moviedata = moviedata[moviedata['budget'] > 6999]
moviedata = moviedata[moviedata['revenue'] > 29]
Checking the final copy of the dataset before the exploratory analysis
moviedata
| id | popularity | title | cast | director | keywords | overview | runtime | genres | production_companies | release_date | vote_count | vote_average | release_year | budget | revenue | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 135397 | 32.99 | Jurassic World | Chris Pratt|Bryce Dallas Howard|Irrfan Khan|Vi... | Colin Trevorrow | monster|dna|tyrannosaurus rex|velociraptor|island | Twenty-two years after the events of Jurassic ... | 124 | Action|Adventure|Science Fiction|Thriller | Universal Studios|Amblin Entertainment|Legenda... | 6/9/15 | 5562 | 6.50 | 2015 | 137999939.30 | 1392445893.00 |
| 1 | 76341 | 28.42 | Mad Max: Fury Road | Tom Hardy|Charlize Theron|Hugh Keays-Byrne|Nic... | George Miller | future|chase|post-apocalyptic|dystopia|australia | An apocalyptic story set in the furthest reach... | 120 | Action|Adventure|Science Fiction|Thriller | Village Roadshow Pictures|Kennedy Miller Produ... | 5/13/15 | 6185 | 7.10 | 2015 | 137999939.30 | 348161292.50 |
| 2 | 262500 | 13.11 | Insurgent | Shailene Woodley|Theo James|Kate Winslet|Ansel... | Robert Schwentke | based on novel|revolution|dystopia|sequel|dyst... | Beatrice Prior must confront her inner demons ... | 119 | Adventure|Science Fiction|Thriller | Summit Entertainment|Mandeville Films|Red Wago... | 3/18/15 | 2480 | 6.30 | 2015 | 101199955.50 | 271619025.40 |
| 3 | 140607 | 11.17 | Star Wars: The Force Awakens | Harrison Ford|Mark Hamill|Carrie Fisher|Adam D... | J.J. Abrams | android|spaceship|jedi|space opera|3d | Thirty years after defeating the Galactic Empi... | 136 | Action|Adventure|Science Fiction|Fantasy | Lucasfilm|Truenorth Productions|Bad Robot | 12/15/15 | 5292 | 7.50 | 2015 | 183999919.00 | 1902723130.00 |
| 4 | 168259 | 9.34 | Furious 7 | Vin Diesel|Paul Walker|Jason Statham|Michelle ... | James Wan | car race|speed|revenge|suspense|car | Deckard Shaw seeks revenge against Dominic Tor... | 137 | Action|Crime|Thriller | Universal Pictures|Original Film|Media Rights ... | 4/1/15 | 2947 | 7.30 | 2015 | 174799923.10 | 1385748801.00 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 10822 | 396 | 0.67 | Who's Afraid of Virginia Woolf? | Elizabeth Taylor|Richard Burton|George Segal|S... | Mike Nichols | alcohol|adultery|professor|married couple|son | Mike Nichols’ film from Edward Albee's play ... | 131 | Drama | Chenault Productions | 6/21/66 | 74 | 7.50 | 1966 | 50385110.19 | 226643572.40 |
| 10828 | 5780 | 0.40 | Torn Curtain | Paul Newman|Julie Andrews|Lila Kedrova|Hansjö... | Alfred Hitchcock | cold war|east germany | An American scientist publicly defects to East... | 128 | Mystery|Thriller | Universal Pictures | 7/13/66 | 46 | 6.30 | 1966 | 20154044.08 | 87334191.00 |
| 10829 | 6644 | 0.40 | El Dorado | John Wayne|Robert Mitchum|James Caan|Charlene ... | Howard Hawks | sheriff|ranch|liquor|settler|revolver | Cole Thornton, a gunfighter for hire, joins fo... | 120 | Action|Western | Paramount Pictures|Laurel Productions | 12/17/66 | 36 | 6.90 | 1966 | 31258922.36 | 40308088.15 |
| 10835 | 5923 | 0.30 | The Sand Pebbles | Steve McQueen|Richard Attenborough|Richard Cre... | Robert Wise | missionary|china|us navy|chinese|battle | Engineer Jake Holman arrives aboard the gunboa... | 182 | Action|Adventure|Drama|War|Romance | Twentieth Century Fox Film Corporation|Solar P... | 12/20/66 | 28 | 7.00 | 1966 | 80616176.31 | 134360293.80 |
| 10848 | 2161 | 0.21 | Fantastic Voyage | Stephen Boyd|Raquel Welch|Edmond O'Brien|Donal... | Richard Fleischer | submarine|coma|claustrophobia|wound|laser | The science of miniaturization has been unlock... | 100 | Adventure|Science Fiction | Twentieth Century Fox Film Corporation | 8/24/66 | 42 | 6.70 | 1966 | 34362645.15 | 80616176.31 |
3653 rows × 16 columns
moviedata.describe()
| popularity | runtime | vote_count | vote_average | budget | revenue | |
|---|---|---|---|---|---|---|
| count | 3653.00 | 3653.00 | 3653.00 | 3653.00 | 3653.00 | 3653.00 |
| mean | 1.23 | 109.60 | 550.73 | 6.19 | 45652832.34 | 142955209.94 |
| std | 1.50 | 19.86 | 897.05 | 0.79 | 45214412.13 | 219943362.87 |
| min | 0.01 | 26.00 | 10.00 | 2.20 | 7755.18 | 48.38 |
| 25% | 0.48 | 96.00 | 79.00 | 5.70 | 14040500.29 | 20848013.80 |
| 50% | 0.84 | 106.00 | 223.00 | 6.20 | 31341561.28 | 66367636.59 |
| 75% | 1.41 | 120.00 | 600.00 | 6.70 | 62160970.18 | 171410264.00 |
| max | 32.99 | 338.00 | 9767.00 | 8.40 | 425000000.00 | 2827123750.00 |
From the above output, the dataset has been reduced to 3653 movies and 16 columns (features).
My exploratory analysis will be based on this!
## Plotting the continous features versus the Number of Movies to understand how the different features compare quantitatively:
plt.figure(figsize = (20,15))
plt.subplot(241)
plt.hist(moviedata.popularity, bins=10, linewidth=0.5, edgecolor="white")
plt.title("Proportion of popularity")
plt.xlabel("Popularity")
plt.ylabel("Number of movies")
plt.subplot(242)
plt.hist(moviedata.runtime, bins=10, linewidth=0.5, edgecolor="white")
plt.title("Proportion of Runtime")
plt.xlabel("Runtime")
plt.ylabel("Number of movies")
plt.subplot(243)
plt.hist(moviedata.vote_average, bins=10, linewidth=0.5, edgecolor="white")
plt.title("Proportion of Vote average")
plt.xlabel("Vote average")
plt.ylabel("Number of movies")
plt.subplot(244)
plt.hist(moviedata.vote_count, bins=10, linewidth=0.5, edgecolor="white")
plt.title("Proportion of Vote count")
plt.xlabel("Vote count")
plt.ylabel("Number of movies")
plt.subplot(245)
plt.hist(moviedata.budget, bins=10, linewidth=0.5, edgecolor="white")
plt.title("Proportion of movie budget")
plt.xlabel("Movie Budget")
plt.ylabel("Number of movies")
plt.subplot(246)
plt.hist(moviedata.revenue, bins=10, linewidth=0.5, edgecolor="white")
plt.title("Proportion of revenue of movies")
plt.xlabel("Movie revenue")
plt.ylabel("Number of movies")
plt.subplot(247)
plt.hist(moviedata.release_year, bins=10, linewidth=0.5, edgecolor="white")
plt.title("Proportion of Movies released per year")
plt.xlabel("Years of release")
plt.ylabel("Number of movies")
Text(0, 0.5, 'Number of movies')
moviedata.budget.describe()
count 3653.00 mean 45652832.34 std 45214412.13 min 7755.18 25% 14040500.29 50% 31341561.28 75% 62160970.18 max 425000000.00 Name: budget, dtype: float64
A mean budget of 45,652832.34 dollars versus a median budget of 31,341561.28 suggests that the outliers in the dataset are very influential. This also indicates that the distribution is skewed to the left. Thus the median may be the best measure of the central tendency.
Next, I will visualize the distribution of the movie budgets
plt.figure(figsize=(8,8))
plt.hist(moviedata.budget,30)
plt.xlabel('Movie Budget')
plt.ylabel('Number of Movies')
plt.title('Movie budget in hundred of millions')
Text(0.5, 1.0, 'Movie budget in hundred of millions')
There is an exponential decrease in the distribution of movie budgets with more than 75% of the movies having a budget less than 25 million dollars.
Next, I will take a look at the 10 most expensive movies in the dataset, their release year, the revenue and the profit rate they generated for the producing companies
# adding the profit rate column to the dataset using the budget and revenue columns:
moviedata['profit_rate'] = moviedata['revenue'] / moviedata['budget']
moviedata[moviedata['budget'].notnull()][['title','release_year',
'budget', 'revenue', 'profit_rate']].sort_values('budget',
ascending=False,ignore_index=True).head(10)
| title | release_year | budget | revenue | profit_rate | |
|---|---|---|---|---|---|
| 0 | The Warrior's Way | 2010 | 425000000.00 | 11087569.00 | 0.03 |
| 1 | Pirates of the Caribbean: On Stranger Tides | 2011 | 368371256.20 | 990417500.30 | 2.69 |
| 2 | Pirates of the Caribbean: At World's End | 2007 | 315500574.80 | 1010653508.00 | 3.20 |
| 3 | Superman Returns | 2006 | 292050672.70 | 423020463.80 | 1.45 |
| 4 | Titanic | 1997 | 271692064.20 | 2506405735.00 | 9.23 |
| 5 | Spider-Man 3 | 2007 | 271330494.30 | 936901700.20 | 3.45 |
| 6 | Tangled | 2010 | 260000000.00 | 591794936.00 | 2.28 |
| 7 | Avengers: Age of Ultron | 2015 | 257599886.70 | 1292632337.00 | 5.02 |
| 8 | Harry Potter and the Half-Blood Prince | 2009 | 254100108.50 | 949276533.30 | 3.74 |
| 9 | Waterworld | 1995 | 250419201.70 | 378087518.50 | 1.51 |
From the above, The Warrior's Way is the most expensive movie in the dataset. Interestingly, all the movies that made the list were profitable, with Titanic making nine times the amount invested in producing it. Also, noteworthy is the fact that the most expensive movie on the list earned the lowest return, suggesting that high budgetting does not always translate to high profitability.
However, it is important to note that inflation may have had some influence on the list.
Next, I will visualize the most expensive movies list on a bar chart:
cols = ['title', 'budget']
budget = moviedata.sort_values('budget', ascending=False)[cols].set_index('title')
mostexpensive = budget.head(10)
sns.barplot(data=mostexpensive, x=mostexpensive.index, y='budget')
plt.xticks(ha='left', rotation=-20, fontsize=10)
plt.yticks(fontsize=10)
plt.title('Top 10 Most Expensive Movies', fontsize=10)
Text(0.5, 1.0, 'Top 10 Most Expensive Movies')
Next, I will visualize the relationship between the revenue and budget on a scatter plot
rev_budg = px.scatter(moviedata, x="budget", y="revenue", trendline="ols",
title="Relationship between Budget and Revenue")
rev_budg.update_layout(xaxis_title="Budget",
yaxis_title="Revenue")
rev_budg.show()
moviedata['budget'].corr(moviedata['revenue'])
0.5636917736722413
From the correlation coefficient above, it is clear that budget is moderately correlated to budget.
# visualization movie budgets over the years:
plt.figure(figsize=(15,10))
yearlybudg = moviedata[(moviedata['budget'].notnull()) & (moviedata['release_year'] != 'NaT')].groupby('release_year')['budget'].max()
plt.plot(yearlybudg.index, yearlybudg)
plt.xticks(np.arange(1960, 2020, 10.0))
plt.xlabel('Year')
plt.ylabel('Movie Budgets in hundreds of millions')
plt.title('Movie Budgets over the years')
plt.show()
As seen in the graph, movie budgets have increased over the years, with the most expensive movie breaking the 400 million dollar mark at the tail end of the last decade in the years analyzed. Interestingly, there has been a sharp decrease since then, but the trend seems to be on a resurge since 2015.
I will start by looking at the top 10 grossing movies in the dataset
top_gross = moviedata[['title', 'release_year',
'budget', 'revenue']].sort_values('revenue', ascending=False,ignore_index=True).head(10)
pd.set_option('display.max_colwidth', 100)
top_gross
| title | release_year | budget | revenue | |
|---|---|---|---|---|
| 0 | Avatar | 2009 | 240886902.90 | 2827123750.00 |
| 1 | Star Wars | 1977 | 39575591.36 | 2789712242.00 |
| 2 | Titanic | 1997 | 271692064.20 | 2506405735.00 |
| 3 | The Exorcist | 1973 | 39289276.63 | 2167324901.00 |
| 4 | Jaws | 1975 | 28362748.20 | 1907005842.00 |
| 5 | Star Wars: The Force Awakens | 2015 | 183999919.00 | 1902723130.00 |
| 6 | E.T. the Extra-Terrestrial | 1982 | 23726245.23 | 1791694309.00 |
| 7 | The Net | 1995 | 31481271.08 | 1583049536.00 |
| 8 | One Hundred and One Dalmatians | 1961 | 29179444.83 | 1574814740.00 |
| 9 | The Avengers | 2012 | 208943741.90 | 1443191435.00 |
Interestingly, the highest grossing movies in the dataset were mostly released pre-2000s. This suggests that movie revenues may have dropped over time.
To have a better understanding of this, I will visualize the relationship between the profit_rate and the release_year.
plt.figure(figsize=(15,10))
year_prof = moviedata[(moviedata['profit_rate'].notnull()) & (moviedata['release_year'] != 'NaT')].groupby('release_year')['profit_rate'].max()
plt.plot(year_prof.index, year_prof)
plt.xticks(np.arange(1960, 2020, 10.0))
plt.xlabel('Year')
plt.ylabel('Movie Profit Rate')
plt.title('Movie Profit Rates over the years')
plt.show()
The biggest profits seem to come from the 2000s negating the higher revenue ratio from pre-2000s.
Next, I will plot the highest movie revenues through the years.
plt.figure(figsize=(15,10))
yearly_rev = moviedata[(moviedata['revenue'].notnull()) & (moviedata['release_year'] != 'NaT')].groupby('release_year')['revenue'].max()
plt.plot(yearly_rev.index, yearly_rev)
plt.xticks(np.arange(1960, 2020, 10.0))
plt.xlabel('Year')
plt.ylabel('Movie Revenue in billions')
plt.title('Movie Revenue over the years')
plt.show()
As seen from the graph above, movie revenues have been rising and falling over the years. Interestingly, the first movie to break 2.5 billion dollar mark was released in the late 70s with no movie coming close to it till late 2000s.
Which movies are the most successful in the dataset?
To do this, I will find the movies with the highest profit rates in the dataset.
moviedata[(moviedata['profit_rate'].notnull()) & (moviedata['budget'])][['title','release_year',
'budget', 'revenue','profit_rate']
].sort_values('profit_rate',
ascending=False,ignore_index=True).head(10)
| title | release_year | budget | revenue | profit_rate | |
|---|---|---|---|---|---|
| 0 | Paranormal Activity | 2007 | 15775.03 | 203346220.10 | 12890.39 |
| 1 | The Blair Witch Project | 1999 | 32726.32 | 324645106.00 | 9920.00 |
| 2 | Eraserhead | 1977 | 35977.81 | 25184467.23 | 700.00 |
| 3 | Pink Flamingos | 1972 | 62574.73 | 31287365.59 | 500.00 |
| 4 | Super Size Me | 2004 | 75038.95 | 32988367.35 | 439.62 |
| 5 | The Gallows | 2015 | 91999.96 | 39251239.93 | 426.64 |
| 6 | Open Water | 2004 | 150077.90 | 63111168.01 | 420.52 |
| 7 | The Texas Chain Saw Massacre | 1974 | 375894.13 | 136467258.50 | 363.05 |
| 8 | Mad Max | 1979 | 1201821.60 | 300455400.30 | 250.00 |
| 9 | Halloween | 1978 | 1002810.21 | 233989048.60 | 233.33 |
With a profit rate of more than 12,000%, Paranormal Activity is the most successful movie in the dataset. Interestingly, none of the movies on the list cost beyond 2 million dollars to produce.
Which movies are the least successful in the dataset?
To do this, I will find the movies with the lowest profit rates in the dataset
moviedata[(moviedata['profit_rate'].notnull()) & (moviedata['budget'])][['title','release_year',
'budget', 'revenue','profit_rate']
].sort_values('profit_rate',
ascending=True,ignore_index=True).head(10)
| title | release_year | budget | revenue | profit_rate | |
|---|---|---|---|---|---|
| 0 | Charlotte's Web | 2006 | 91941878.45 | 155.76 | 0.00 |
| 1 | Brother Bear | 2003 | 118535264.50 | 296.34 | 0.00 |
| 2 | Teenage Mutant Ninja Turtles II: The Secret of the Ooze | 1991 | 40027321.05 | 124.89 | 0.00 |
| 3 | Death at a Funeral | 2007 | 9465017.24 | 48.38 | 0.00 |
| 4 | Death Defying Acts | 2007 | 21033371.65 | 3744.99 | 0.00 |
| 5 | The Samaritan | 2012 | 11396931.38 | 2394.31 | 0.00 |
| 6 | The Adventurer: The Curse of the Midas Box | 2013 | 23400833.82 | 5989.68 | 0.00 |
| 7 | Chaos | 2005 | 13398759.48 | 11488.32 | 0.00 |
| 8 | 5 Days of War | 2011 | 19387960.85 | 16944.11 | 0.00 |
| 9 | Sweetwater | 2013 | 6552233.47 | 5753.80 | 0.00 |
Charlotte's Web produced in 2006 with a budget of over 91 billion dollars is the most disastrous movie release in the dataset. Interestingly, none of the movies which made the list cost less than 6 million dollars suggesting that budgetting heavily on a movie production is not an assurance of it not failing.
Who are the most successful directors in the dataset?
To answer this, I will take a look at the directors who directed the movies with the highest profit rates.
director_mov = moviedata.groupby('director')['revenue'].count().sort_values(ascending=False)
director_mov = list(director_mov[director_mov].index)
top_director = moviedata[(moviedata['profit_rate'].notnull())
&(moviedata['revenue']) & (moviedata['director'].isin(director_mov))]
pd.DataFrame(top_director.groupby('director')['profit_rate'].mean().sort_values(ascending=False).head(10))
| profit_rate | |
|---|---|
| director | |
| John Carpenter | 22.99 |
| Wes Craven | 11.56 |
| Robert Zemeckis | 5.82 |
| Francis Ford Coppola | 5.07 |
| Richard Donner | 4.90 |
| Clint Eastwood | 4.74 |
| Woody Allen | 4.37 |
| Tony Scott | 4.29 |
| Michael Bay | 3.92 |
| Oliver Stone | 3.75 |
John Carpenter is the most successful director in the dataset.
Next, I will take a look at the movies he directed, the year he directed them and the profit rate for each movie
moviedata[(moviedata['director'] == 'John Carpenter')&
(moviedata['profit_rate'].notnull())][['title', 'budget', 'revenue', 'profit_rate', 'release_year']]
| title | budget | revenue | profit_rate | release_year | |
|---|---|---|---|---|---|
| 2048 | The Ward | 10000000.00 | 498974.00 | 0.05 | 2010 |
| 2742 | Ghosts of Mars | 34481667.27 | 17254173.11 | 0.50 | 2001 |
| 7324 | The Fog | 2646036.75 | 56567928.92 | 21.38 | 1980 |
| 7906 | Starman | 46178924.86 | 60335611.63 | 1.31 | 1984 |
| 8315 | Memoirs of an Invisible Man | 62160970.18 | 22312731.53 | 0.36 | 1992 |
| 8384 | Escape from New York | 14389144.83 | 119909540.20 | 8.33 | 1981 |
| 8494 | Escape from L.A. | 69510836.95 | 58774700.50 | 0.85 | 1996 |
| 8890 | The Thing | 22596424.03 | 31144285.17 | 1.38 | 1982 |
| 9484 | They Live | 7375564.09 | 23987045.56 | 3.25 | 1988 |
| 9651 | Prince of Darkness | 5757241.31 | 27217342.96 | 4.73 | 1987 |
| 10489 | Big Trouble in Little China | 49735160.27 | 21883470.52 | 0.44 | 1986 |
| 10759 | Halloween | 1002810.21 | 233989048.60 | 233.33 | 1978 |
Interestingly, John Carpenter's most successful movies till date were his first two movies titled Halloween and The Fog produced in 1978 and 1980 respectively. His worst movie return was the last movie he directed in 2010 titled, The Ward suggesting a possible decrease in his directing skills. Noteworthingly, his most successful movies were also the ones which cost the least.
Which director brought in the highest average revenue in the dataset?
plt.title("Directors with Highest Average Revenue")
moviedata[moviedata['director'].isin(director_mov)
].groupby('director')['revenue'].mean().sort_values(ascending=False).head(10).plot(kind='bar', colormap='Blues_r')
plt.show()
Michael Bay is your go-to director for anyone looking to get the highest grossing on their movies!
direc_spent = moviedata.groupby('director')['budget'].mean().head()
direc_spent.plot(kind = 'pie', autopct = '%1.1f%%', figsize = (8,8), fontsize = 12)
plt.tight_layout()
plt.title('Highest Budget by Directors', fontsize = 15)
Text(0.5, 1.0, 'Highest Budget by Directors')
Which companies generated the most revenues in the dataset?
To answer this, I will only consider those companies that have made at least 20 movies.
comp = moviedata.apply(lambda x: pd.Series(x['production_companies']),axis=1).stack().reset_index(level=1, drop=True)
comp.name = 'companies'
comp_data = moviedata.drop('production_companies', axis=1).join(comp)
# grouping the company columns:
comp_sum = pd.DataFrame(comp_data.groupby('companies')['revenue'].sum().sort_values(ascending=False))
comp_sum.columns = ['Total']
comp_av = pd.DataFrame(comp_data.groupby('companies')['revenue'].mean().sort_values(ascending=False))
comp_av.columns = ['Average']
comp_count = pd.DataFrame(comp_data.groupby('companies')['revenue'].count().sort_values(ascending=False))
comp_count.columns = ['Number']
comp_agg = pd.concat((comp_sum, comp_av, comp_count), axis=1)
comp_agg[comp_agg['Number'] >= 20].sort_values('Number', ascending=False).head()
| Total | Average | Number | |
|---|---|---|---|
| companies | |||
| Paramount Pictures | 12185100969.85 | 162468012.93 | 75 |
| Universal Pictures | 6157750245.07 | 109959825.80 | 56 |
| Columbia Pictures | 6103154562.06 | 156491142.62 | 39 |
| New Line Cinema | 2821884350.80 | 78385676.41 | 36 |
| Warner Bros. | 3827928035.61 | 119622751.11 | 32 |
Based on the output, Paramount Pictures generated the most revenue among the production companies. As expected, they produced the most movies too.
Most voted movies
Which movies have the most votes in the dataset?
moviedata[moviedata['vote_count'].notnull()
][['title','revenue', 'release_year','vote_count']
].sort_values('vote_count', ascending=False, ignore_index=True).head(10)
| title | revenue | release_year | vote_count | |
|---|---|---|---|---|
| 0 | Inception | 825500000.00 | 2010 | 9767 |
| 1 | The Avengers | 1443191435.00 | 2012 | 8903 |
| 2 | Avatar | 2827123750.00 | 2009 | 8458 |
| 3 | The Dark Knight | 1014733032.00 | 2008 | 8432 |
| 4 | Django Unchained | 403991051.50 | 2012 | 7375 |
| 5 | The Hunger Games | 656473401.90 | 2012 | 7080 |
| 6 | Iron Man 3 | 1137692373.00 | 2013 | 6882 |
| 7 | The Dark Knight Rises | 1026712780.00 | 2012 | 6723 |
| 8 | Interstellar | 572690645.10 | 2014 | 6498 |
| 9 | The Hobbit: An Unexpected Journey | 965893322.80 | 2012 | 6417 |
The movies released in late 2000s and in the 2010s have the highest votes. This is expected given that the time coincides with the increase in the usage of social media/internet globally which makes it easier for viewers to see movies.
#Visualization of Top 10 most voted movies:
cols = ['title', 'popularity']
movievote = moviedata.sort_values('popularity', ascending=False)[cols].set_index('title')
Top10vote = movievote.head(10)
sns.barplot(data=Top10vote, x=Top10vote.index, y='popularity')
plt.xticks(ha='left', rotation=-20, fontsize=10)
plt.yticks(fontsize=10)
plt.title('Top 10 most voted movies', fontsize=10)
Text(0.5, 1.0, 'Top 10 most voted movies')
Which movies are the most popular in the dataset?
moviedata[moviedata['popularity'].notnull()][['title','popularity','release_year']
].sort_values('popularity',ascending=False,ignore_index=True).head(10)
| title | popularity | release_year | |
|---|---|---|---|
| 0 | Jurassic World | 32.99 | 2015 |
| 1 | Mad Max: Fury Road | 28.42 | 2015 |
| 2 | Interstellar | 24.95 | 2014 |
| 3 | Guardians of the Galaxy | 14.31 | 2014 |
| 4 | Insurgent | 13.11 | 2015 |
| 5 | Captain America: The Winter Soldier | 12.97 | 2014 |
| 6 | Star Wars | 12.04 | 1977 |
| 7 | John Wick | 11.42 | 2014 |
| 8 | Star Wars: The Force Awakens | 11.17 | 2015 |
| 9 | The Hunger Games: Mockingjay - Part 1 | 10.74 | 2014 |
90% of the most popular movies in the dataset were released in the 2010s. Interestingly, the only outlier in the list was Star Wars released in 1977. That is a big deal given that many movies have been released since then.
#visualization of the most popular movies
pop_movies = moviedata.sort_values('popularity', ascending = False)
plt.figure(figsize = (15,10))
plt.barh(pop_movies['title'].head(6),pop_movies['popularity'].head(6), align = 'center', color = 'green')
plt.gca().invert_yaxis()
plt.xlabel("Popularity")
plt.title("Popular Movies")
Text(0.5, 1.0, 'Popular Movies')
What is the relationship between popularity and revenue generated?
pop_rev = px.scatter(moviedata, x="popularity", y="revenue", trendline="ols",
title="Relationship between Popularity and Revenue")
pop_rev.update_layout(xaxis_title="Popularity",
yaxis_title="Revenue")
pop_rev.show()
With a R-squared value of 0.29, this means that 29% of the variance in the number of movie revenue can be explained by the popularity of the movie
Did heavily-budgetted movies turn out popular?
pop_budg = px.scatter(moviedata, x="popularity", y="budget", trendline="ols",
title="Relationship between Popularity and Budget")
pop_budg.update_layout(xaxis_title="Popularity",
yaxis_title="Budget")
pop_budg.show()
With a R-squared value of 0.15, this means that 15% of the variance in the movie budget can be explained by the popularity of the movie
What is the mean rating for all movies based on the voting averages?
mean_rating = moviedata.vote_average.mean()
mean_rating
6.18617574596222
The mean rating of the movies in the dataset stand at approximately 6 on a scale of 10. This means 60% of the movies were satisfactory to the viewers.
What is the minimum votes required to be in the top chart given a 95% percentile?
To do this, I will firstly filter out the movies that qualify for the chart based on the given percentile.
qual_movie = moviedata.query('vote_count >= vote_count.quantile(0.95)')
qual_movie.shape
(183, 17)
minvote = moviedata['vote_count'].quantile(0.95)
minvote
2337.199999999999
According to the output, 183 movies qualify to be in the top chart given a 95% percentile of highest voted movies. On the average, these movies require approximately 2337 votes to make it to the list.
Time of Movie Release Analysis
How many movies were released in each month in the dataset?
month_ord = ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec']
def get_month(x):
try:
return month_ord[int(str(x).split('/')[1]) - 1]
except:
return np.nan
moviedata['release_month'] = moviedata['release_date'].apply(get_month)
plt.figure(figsize=(15,10))
plt.title("Number of Movies released in a month.")
sns.countplot(x='release_month', data=moviedata, order=month_ord)
<AxesSubplot:title={'center':'Number of Movies released in a month.'}, xlabel='release_month', ylabel='count'>
November, December and January are the most popular months for movie release in the dataset, while there is a sharp decline in movie release in subsequent months, with march and april are the least popular months. The most popular months may be explained by closeness to Christmas and New Year celebrations.
Which month do movies generate the most revenue?
To answer this, I will look at movies which generated revenues above 100,000,000 million dollars
month_mean = pd.DataFrame(moviedata[moviedata['revenue'] > 100000000].groupby('release_month')['revenue'].mean())
month_mean['mon'] = month_mean.index
plt.figure(figsize=(15,10))
plt.title("Average Gross Per Month for Blockbuster Movies")
sns.barplot(x='mon', y='revenue', data=month_mean, order=month_ord)
<AxesSubplot:title={'center':'Average Gross Per Month for Blockbuster Movies'}, xlabel='mon', ylabel='revenue'>
January recorded the highest gross revenues for blockbuster movies that generated >100,000,000 dollars. Of all the months, April and August performed the least.
I think there must be something special about April for it to make the list of the least performing month in terms of release and revenue generated.
Which month did movies recorded the highest profit rate?
fig, ax = plt.subplots(nrows=1, ncols=1,figsize=(15, 10))
sns.boxplot(x='release_month', y='profit_rate', data=moviedata[moviedata['profit_rate'].notnull()], palette="muted", ax =ax, order=month_ord)
ax.set_ylim([0, 15])
(0.0, 15.0)
Again, January performed the best while April performed the least. July surprisingly performed great!
Did movie runtime increase over the years?
plt.figure(figsize=(15,10))
year_runtime = moviedata[moviedata['release_year'] != 'NaT'].groupby('release_year')['runtime'].mean()
plt.plot(year_runtime.index, year_runtime)
plt.xticks(np.arange(1960, 2014, 10.0))
plt.xlabel('Release Year')
plt.ylabel('Run time')
plt.title('Run time over the years')
plt.show()
From the graph, movie producers and directors have obviously gone for shorter movies over the years.
What are the shortest movies ever produced?
moviedata[moviedata['runtime'] > 0][['runtime', 'title',
'release_year']].sort_values('runtime',ignore_index=True).head(10)
| runtime | title | release_year | |
|---|---|---|---|
| 0 | 26 | Mickey's Christmas Carol | 1983 |
| 1 | 63 | Winnie the Pooh | 2011 |
| 2 | 66 | 9 Songs | 2004 |
| 3 | 69 | The Land Before Time | 1988 |
| 4 | 72 | Return to Never Land | 2002 |
| 5 | 72 | The Jungle Book 2 | 2003 |
| 6 | 74 | The Great Mouse Detective | 1986 |
| 7 | 74 | Fantasia 2000 | 1999 |
| 8 | 75 | Justice League: The New Frontier | 2008 |
| 9 | 75 | Cats Don't Dance | 1997 |
Interestingly, each decade from 1980s to 2010s was represented in the least of the shortest movies ever produced.
What are the longest movies ever produced?
moviedata[moviedata['runtime'] > 0][['runtime',
'title', 'release_year']].sort_values('runtime',
ascending=False,ignore_index=True).head(10)
| runtime | title | release_year | |
|---|---|---|---|
| 0 | 338 | Carlos | 2010 |
| 1 | 248 | Cleopatra | 1963 |
| 2 | 219 | Heaven's Gate | 1980 |
| 3 | 216 | Lawrence of Arabia | 1962 |
| 4 | 214 | Gods and Generals | 2003 |
| 5 | 213 | Jodhaa Akbar | 2008 |
| 6 | 202 | Malcolm X | 1992 |
| 7 | 201 | The Lord of the Rings: The Return of the King | 2003 |
| 8 | 200 | The Godfather: Part II | 1974 |
| 9 | 199 | The Greatest Story Ever Told | 1965 |
Even better, each decade from 1960s to 2010s was represented in the list of the longest movies ever produced in the dataset
How correlated are the features measured in the dataset?
corr = moviedata.corr()
corr.style.background_gradient(cmap='coolwarm')
| popularity | runtime | vote_count | vote_average | budget | revenue | profit_rate | |
|---|---|---|---|---|---|---|---|
| popularity | 1.000000 | 0.211852 | 0.777181 | 0.319976 | 0.390374 | 0.540896 | -0.000759 |
| runtime | 0.211852 | 1.000000 | 0.273814 | 0.357021 | 0.334833 | 0.279330 | -0.034789 |
| vote_count | 0.777181 | 0.273814 | 1.000000 | 0.391628 | 0.492887 | 0.650684 | 0.004373 |
| vote_average | 0.319976 | 0.357021 | 0.391628 | 1.000000 | 0.029886 | 0.268006 | 0.007154 |
| budget | 0.390374 | 0.334833 | 0.492887 | 0.029886 | 1.000000 | 0.563692 | -0.031821 |
| revenue | 0.540896 | 0.279330 | 0.650684 | 0.268006 | 0.563692 | 1.000000 | 0.021838 |
| profit_rate | -0.000759 | -0.034789 | 0.004373 | 0.007154 | -0.031821 | 0.021838 | 1.000000 |
Interestingly, most of the features are positively correlated with one another. Most noteworthy of the positive correlations are: popularity/vote_count and vote_count/revenue which were strongly correlated.
Also, most of the features were weakly correlated (positive/negative) with profit rate. This suggests that many factors beyond the features listed in the dataset plays a role in how profitable a movie turns out to be.
Computing the Term Frequency-Inverse Document Frequency (TF-IDF) vectors for each overview:
#Importing TfIdfVectorizer from scikit-learn
from sklearn.feature_extraction.text import TfidfVectorizer
#Defining a TF-IDF Vectorizer Object. Removing all english stop words such as 'the', 'a'
tf_idf = TfidfVectorizer(stop_words='english')
#Replacing NaN with an empty string
moviedata['overview'] = moviedata['overview'].fillna('')
#Constructing the required TF-IDF matrix by fitting and transforming the data
tf_idf_matrix = tf_idf.fit_transform(moviedata['overview'])
#Outputing the shape of tfidf_matrix
tf_idf_matrix.shape
(3653, 17904)
From the output, this means that 32,786 different words were used to describe the 10866 movies in our dataset. I will use the cosine similarity to calculate a numeric quantity that denotes the similarity between two movies:
# Importing linear_kernel
from sklearn.metrics.pairwise import linear_kernel
# Computing the cosine similarity matrix
cosine_sim = linear_kernel(tf_idf_matrix, tf_idf_matrix)
# Defining a function that takes in a movie title as an input and outputs a list of the 10 most similar movies:
#Construct a reverse map of indices and movie titles
indices = pd.Series(moviedata.index, index=moviedata['title']).drop_duplicates()
# Function that takes in movie title as input and outputs most similar movies
def get_recommendations(title, cosine_sim=cosine_sim):
# Get the index of the movie that matches the title
idx = indices[title]
# Get the pairwsie similarity scores of all movies with that movie
sim_scores = list(enumerate(cosine_sim[idx]))
# Sort the movies based on the similarity scores
sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
# Get the scores of the 10 most similar movies
sim_scores = sim_scores[1:11]
# Get the movie indices
movie_indices = [i[0] for i in sim_scores]
# Return the top 10 most similar movies
return moviedata['title'].iloc[movie_indices]
# Testing the function to see movies with similaries based on title:
get_recommendations('Jurassic World')
5391 The Lost World: Jurassic Park 10223 Jurassic Park 5741 Austenland 7996 Jaws 3 7995 National Lampoon's Vacation 85 Vacation 1582 Adventureland 5584 The Way Way Back 4467 Piranha 3DD 6666 Snow Cake Name: title, dtype: object
get_recommendations('Interstellar')
3391 Bad Teacher 8734 Titan A.E. 10257 Manhattan Murder Mystery 6658 A Scanner Darkly 765 I, Frankenstein 6242 Elizabethtown 7654 You Kill Me 7126 Paparazzi 3577 Dream House 673 Into the Woods Name: title, dtype: object
get_recommendations('Mad Max: Fury Road')
4072 Swept Away 6966 The Notebook 2006 Legion 10391 Xi yan 6663 A Good Year 7921 Paris, Texas 4008 Femme Fatale 7894 Dune 8704 The 6th Day 3026 The Other Man Name: title, dtype: object
get_recommendations('Guardians of the Galaxy')
723 The Best of Me 4966 The Jungle Book 2 687 The Judge 2283 The Romantics 8460 Fargo 4571 Think Like a Man 2164 Tomorrow, When the War Began 2173 Devil 1724 The Joneses 5750 The Best Man Holiday Name: title, dtype: object
get_recommendations('Insurgent')
6770 The Covenant 4635 Cosmopolis 1595 My Sister's Keeper 2672 Shallow Hal 10315 Menace II Society 2555 Music of the Heart 10022 Mo' Better Blues 10303 Beethoven's 2nd 111 Dark Places 7969 Ghoulies Name: title, dtype: object
It will be interesting if the dataset included production location details to further analyze how specific location influence the success or failure of a movie release